Search CORE

82 research outputs found

FluentEditor: Text-based Speech Editing by Considering Acoustic and Prosody Consistency

Author: Jiang Ziyue
Li Haizhou
Liu Rui
Xi Jiatian
Publication venue
Publication date: 21/09/2023
Field of study

Text-based speech editing (TSE) techniques are designed to enable users to edit the output audio by modifying the input text transcript instead of the audio itself. Despite much progress in neural network-based TSE techniques, the current techniques have focused on reducing the difference between the generated speech segment and the reference target in the editing region, ignoring its local and global fluency in the context and original utterance. To maintain the speech fluency, we propose a fluency speech editing model, termed \textit{FluentEditor}, by considering fluency-aware training criterion in the TSE training. Specifically, the \textit{acoustic consistency constraint} aims to smooth the transition between the edited region and its neighboring acoustic segments consistent with the ground truth, while the \textit{prosody consistency constraint} seeks to ensure that the prosody attributes within the edited regions remain consistent with the overall style of the original utterance. The subjective and objective experimental results on VCTK demonstrate that our \textit{FluentEditor} outperforms all advanced baselines in terms of naturalness and fluency. The audio samples and code are available at \url{https://github.com/Ai-S2-Lab/FluentEditor}.Comment: Submitted to ICASSP'202

arXiv.org e-Print Archive

When Online Auction Meets Virtual Reality: An Empirical Investigation

Author: Guo Xingyao
Jiang Ziyue
Yan Zhenbin
Zhang Yi
Zhou Zhongyun
Publication venue: AIS Electronic Library (AISeL)
Publication date: 25/01/2024
Field of study

The online auction is becoming increasingly popular in e-commerce, which allows to sell a product to the buyer with the highest bid. However, the lack of authentic product details for a thorough evaluation still poses challenges to its success. Recently, virtual reality (VR) is introduced to online auctions. We employ a unique dataset to investigate the effects of VR on auction outcomes and bidding activities. Results show that VR enhances buyers’ bidding competition, which in turn increases auction success and price, resulting in a competitive effect. Additionally, we find VR boosts buyers’ strategic responses to the bidding war, leading to a late-bidding effect. Findings contribute to both the theory and practice of VR and online auctions in selling houses

AIS Electronic Library (AISeL)

FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models

Author: Huang Rongjie
Jiang Ziyue
Ren Yi
Yang Qian
Ye Zhenhui
Zhao Zhou
Zuo Jialong
Publication venue
Publication date: 22/05/2023
Field of study

Stutter removal is an essential scenario in the field of speech editing. However, when the speech recording contains stutters, the existing text-based speech editing approaches still suffer from: 1) the over-smoothing problem in the edited speech; 2) lack of robustness due to the noise introduced by stutter; 3) to remove the stutters, users are required to determine the edited region manually. To tackle the challenges in stutter removal, we propose FluentSpeech, a stutter-oriented automatic speech editing model. Specifically, 1) we propose a context-aware diffusion model that iteratively refines the modified mel-spectrogram with the guidance of context features; 2) we introduce a stutter predictor module to inject the stutter information into the hidden sequence; 3) we also propose a stutter-oriented automatic speech editing (SASE) dataset that contains spontaneous speech recordings with time-aligned stutter labels to train the automatic stutter localization model. Experimental results on VCTK and LibriTTS datasets demonstrate that our model achieves state-of-the-art performance on speech editing. Further experiments on our SASE dataset show that FluentSpeech can effectively improve the fluency of stuttering speech in terms of objective and subjective metrics. Code and audio samples can be found at https://github.com/Zain-Jiang/Speech-Editing-Toolkit.Comment: Accepted by ACL 2023 (Findings

arXiv.org e-Print Archive

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

Author: Jiang Ziyue
Liu Jinglin
Ren Yi
Yang Qian
Ye Zhenhui
Zhao Zhou
Zhe Su
Publication venue
Publication date: 09/10/2022
Field of study

Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech (TTS) systems. However, previous approaches require substantial annotated training data and additional efforts from language experts, making it difficult to extend high-quality neural TTS systems to out-of-domain daily conversations and countless languages worldwide. This paper tackles the polyphone disambiguation problem from a concise and novel perspective: we propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary (the existing prior information in the natural language). Specifically, we design a semantics-to-pronunciation attention (S2PA) module to match the semantic patterns between the input text sequence and the prior semantics in the dictionary and obtain the corresponding pronunciations; The S2PA module can be easily trained with the end-to-end TTS model without any annotated phoneme labels. Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy and improves the prosody modeling of TTS systems. Further extensive analyses demonstrate that each design in Dict-TTS is effective. The code is available at \url{https://github.com/Zain-Jiang/Dict-TTS}.Comment: Accepted by NeurIPS 202

arXiv.org e-Print Archive

Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis

Author: Jiang Ziyue
Liu Jinglin
Ma Zejun
Ren Yi
Ye Zhenhui
Yin Xiang
Zhang Chen
Zhao Zhou
Publication venue
Publication date: 06/06/2023
Field of study

We are interested in a novel task, namely low-resource text-to-talking avatar. Given only a few-minute-long talking person video with the audio track as the training data and arbitrary texts as the driving input, we aim to synthesize high-quality talking portrait videos corresponding to the input text. This task has broad application prospects in the digital human industry but has not been technically achieved yet due to two challenges: (1) It is challenging to mimic the timbre from out-of-domain audio for a traditional multi-speaker Text-to-Speech system. (2) It is hard to render high-fidelity and lip-synchronized talking avatars with limited training data. In this paper, we introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which (1) designs a generic zero-shot multi-speaker TTS model that well disentangles the text content, timbre, and prosody; and (2) embraces recent advances in neural rendering to achieve realistic audio-driven talking face video generation. With these designs, our method overcomes the aforementioned two challenges and achieves to generate identity-preserving speech and realistic talking person video. Experiments demonstrate that our method could synthesize realistic, identity-preserving, and audio-visual synchronized talking avatar videos.Comment: 6 pages, 3 figure

arXiv.org e-Print Archive